Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table Scan Delete File Handling: Positional and Equality Delete Support #652

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Sep 27, 2024

This PR adds support for handling of both positional and equality delete files within table scans.

The approach taken is to include a list of delete file paths in every FileScanTask. At the moment it is assumed that this list refers to delete files that may apply to the data file in the scan, rather than having been filtered to only contain delete files that definitely do apply to this data file. Further optimisation of plan_files is expected in the future to ensure that this list is pre-filtered before being included in the FileScanTask.

This PR has been refactored so that this work can be split into multiple PRs.

This PR now contains solely the scan plan changes.

@sdd sdd force-pushed the feature/table-scan-delete-file-handling branch 3 times, most recently from 0a64237 to 28021a4 Compare October 10, 2024 18:03
@sdd sdd marked this pull request as ready for review October 10, 2024 18:26
@sdd sdd changed the title WIP: Table Scan Delete File Handling Table Scan Delete File Handling: Positional Delete Support Oct 10, 2024
@sdd
Copy link
Contributor Author

sdd commented Oct 10, 2024

@Xuanwo, @liurenjie1024: This is now ready to review, PTAL when you guys get chance. Look forward to your feedback 😁

@sdd sdd force-pushed the feature/table-scan-delete-file-handling branch from 2732a49 to 50f8a9e Compare October 24, 2024 17:46
@sdd sdd changed the title Table Scan Delete File Handling: Positional Delete Support Table Scan Delete File Handling: Positional and Equality Delete Support Oct 24, 2024
@sdd sdd force-pushed the feature/table-scan-delete-file-handling branch 3 times, most recently from cf8748a to 7a8d297 Compare October 28, 2024 07:28
@sdd
Copy link
Contributor Author

sdd commented Oct 29, 2024

Hi @liurenjie1024 and @Xuanwo - would either of you be able to review this at some point please? I know it's a bit large, sorry. Thanks :-)

@liurenjie1024
Copy link
Contributor

Hi @liurenjie1024 and @Xuanwo - would either of you be able to review this at some point please? I know it's a bit large, sorry. Thanks :-)

Hi, @sdd Thanks for your patience. In fact I already started reviewing it, and it's a little large, so it may take some time.

@sdd sdd force-pushed the feature/table-scan-delete-file-handling branch from cc5dba4 to df4e86a Compare October 31, 2024 08:23
@sdd
Copy link
Contributor Author

sdd commented Oct 31, 2024

Hey @liurenjie1024 - sorry to make changes whilst you are reviewing. I updated the design of the DeleteFileManager as I was not happy with it.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for your patience, it's a really large pr and took me some time to review. I think generally you've understood how deletion files works, but I have some concerns about current code as it mixed a lot of things together. Deletion hanndling is quite chanllenging, I think the design from java implemention is quite reasonable:

  1. DeleteFileIndex
  2. DeleteFilter
  3. GenericReader

I think maybe we need to have a design to split them into more small parts, what do you think?

crates/iceberg/src/spec/delete_file.rs Outdated Show resolved Hide resolved
crates/iceberg/src/scan.rs Outdated Show resolved Hide resolved

/// A task to scan part of file.
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct FileScanTaskDeleteFile {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may evolve as we add more feature, so I would suggest to make this a crate only data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? It is present inside FileScanTask, all of whose items are pub and is intended for potential consumption outside of the crate.

crates/iceberg/src/spec/delete_file.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/delete_file.rs Outdated Show resolved Hide resolved
crates/iceberg/src/arrow/reader.rs Outdated Show resolved Hide resolved
// that are not applicable to the DataFile?

DeleteFileManagerFuture {
files: self.files.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect even if we ignore pruning techniques to remove unrelated deletion files. Please see this part for details.

@sdd
Copy link
Contributor Author

sdd commented Nov 12, 2024

Thanks so much for the review on this @liurenjie1024 - I've been ill for the past week or two so I've not had chance to work through your review in detail yet. I just wanted to let you know I've seen it and will pick it up when I've recovered. 👍

@liurenjie1024
Copy link
Contributor

Thanks so much for the review on this @liurenjie1024 - I've been ill for the past week or two so I've not had chance to work through your review in detail yet. I just wanted to let you know I've seen it and will pick it up when I've recovered. 👍

Hi, @sdd Sorry to hear that, take care of yourself! Don't worry about this, I'll be happy to discuss about this with you anytime when you're back.

@Fokko Fokko self-requested a review November 28, 2024 15:24
@sdd sdd force-pushed the feature/table-scan-delete-file-handling branch 2 times, most recently from 2ff526f to 091a249 Compare December 11, 2024 19:07
@Fokko
Copy link
Contributor

Fokko commented Dec 19, 2024

@sdd Thanks for doing all this work, could you split out the positional deletes? I think that's already a sizeable chunk.

@sdd
Copy link
Contributor Author

sdd commented Dec 19, 2024

Sure @Fokko - I'm in the middle of a refactor of what I have so far. It aligns the design a bit more closely to the Java DeleteFileIndex while still keeping the more efficient loading process from my original. I was thinking of splitting this PR into three - one that is mostly collating all the delete files into the index, and then two more that each focus on the filtering and application of the two delete types.

@Fokko
Copy link
Contributor

Fokko commented Dec 20, 2024

@sdd Thank you for your understanding, looking forward to the smaller PRs 👍 From PyIceberg I've learned that there are a lot of subtle optimizations and want to make sure that we handle those correctly 👍

@sdd sdd force-pushed the feature/table-scan-delete-file-handling branch 2 times, most recently from 9e9ee48 to a7493f0 Compare December 23, 2024 11:49
@sdd
Copy link
Contributor Author

sdd commented Dec 23, 2024

FAO @liurenjie1024, @Xuanwo, @Fokko:

I've finished refactoring this and after a few rounds I'm happier with the design of the DeleteFileIndex and how it is interacted with in the scan plan phase.

I will follow up once this is merged with another PR similar to @Xuanwo's recent one as I think we can use similar techniques in the scan plan.

I've got a few TODOs in here that mark behaviour that I was unsure of and could do with feedback upon, as well as to indicate missing parts that will be addressed in follow-up PRs.

@sdd sdd force-pushed the feature/table-scan-delete-file-handling branch from d3e4d0d to 22ef190 Compare December 24, 2024 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants